A High-Performance Semi-Supervised Learning Method for Text Chunking
نویسندگان
چکیده
In machine learning, whether one can build a more accurate classifier by using unlabeled data (semi-supervised learning) is an important issue. Although a number of semi-supervised methods have been proposed, their effectiveness on NLP tasks is not always clear. This paper presents a novel semi-supervised method that employs a learning paradigm which we call structural learning. The idea is to find “what good classifiers are like” by learning from thousands of automatically generated auxiliary classification problems on unlabeled data. By doing so, the common predictive structure shared by the multiple classification problems can be discovered, which can then be used to improve performance on the target problem. The method produces performance higher than the previous best results on CoNLL’00 syntactic chunking and CoNLL’03 named entity chunking (English and German).
منابع مشابه
Joint Word Segmentation, POS-Tagging and Syntactic Chunking
Chinese chunking has traditionally been solved by assuming gold standard word segmentation. We find that the accuracies drop drastically when automatic segmentation is used. Inspired by the fact that chunking knowledge can potentially improve segmentation, we explore a joint model that performs segmentation, POStagging and chunking simultaneously. In addition, to address the sparsity of full ch...
متن کاملWord Representations: A Simple and General Method for Semi-Supervised Learning
If we take an existing supervised NLP system, a simple and general way to improve accuracy is to use unsupervised word representations as extra word features. We evaluate Brown clusters, Collobert and Weston (2008) embeddings, and HLBL (Mnih & Hinton, 2009) embeddings of words on both NER and chunking. We use near state-of-the-art supervised baselines, and find that each of the three word repre...
متن کاملتعیین مرز و نوع عبارات نحوی در متون فارسی
Text tokenization is the process of tokenizing text to meaningful tokens such as words, phrases, sentences, etc. Tokenization of syntactical phrases named as chunking is an important preprocessing needed in many applications such as machine translation information retrieval, text to speech, etc. In this paper chunking of Farsi texts is done using statistical and learning methods and the grammat...
متن کاملSemi-Supervised Sequential Labeling and Segmentation Using Giga-Word Scale Unlabeled Data
This paper provides evidence that the use of more unlabeled data in semi-supervised learning can improve the performance of Natural Language Processing (NLP) tasks, such as part-of-speech tagging, syntactic chunking, and named entity recognition. We first propose a simple yet powerful semi-supervised discriminative model appropriate for handling large scale unlabeled data. Then, we describe exp...
متن کاملMinimally Supervised Japanese Named Entity Recognition: Resources and Evaluation
Approaches to named entity recognition that rely on hand-crafted rules and/or supervised learning techniques have limitations in terms of their portability into new domains as well as in the robustness over time. For the purpose of overcoming those limitations, this paper evaluates named entity chunking and classi cation techniques in Japanese named entity recognition in the context of minimall...
متن کامل